subway_cleaned = read_csv("Data/cleaned_subway_data.csv")
# Importing NYC map
nyc_map = st_read(here::here('NYC', 'nyc.shp'), quiet = TRUE)
nycmap = st_transform(nyc_map, crs = 4326)
Here, we investigate various features and overall accessibility levels of subway stations to gain insight into how accessible NYC’s subway stations truly are.
subway_cleaned %>%
ggplot() +
geom_sf(
data = nyc_map, fill = NA
) +
geom_point(
aes(x = station_longitude, y = station_latitude, color = ada),
size = 2.5, alpha = 0.5) +
coord_sf() +
theme_void(base_size = 10) +
theme(legend.position = 'bottom') +
guides(color = guide_legend(
title.position = "top",
override.aes = list(size = 3))) +
scale_color_manual(values = c("FALSE" = "aquamarine3", "TRUE" = "slateblue3")) +
labs(color = "ADA Compliance")
ada compliance ensures that individuals with
disabilities can safely and easily access subway stations. In the
accompanying plot, each dot represents a station, and we can see that
NYC’s stations generally do not perform well in terms of ADA
compliance…
subway_cleaned %>%
ggplot() +
geom_sf(
data = nyc_map, fill = NA
) +
geom_point(
aes(x = station_longitude, y = station_latitude, color = free_crossover),
size = 2.5, alpha = 0.5) +
coord_sf() +
theme_void(base_size = 10) +
theme(legend.position = 'bottom') +
guides(color = guide_legend(
title.position = "top",
override.aes = list(size = 3))) +
labs(color = "Free Crossover")
As the most complex subway system in the world, however, the NYC Subway does provide free crossover points in most of its stations. With this understanding of both the weaknesses (ADA compliance) and strengths (free crossover points) in mind, we now turn our attention to evaluating the overall accessibility performance of the city’s subway stations.
# Select the Variables for Clustering
clustering_data <- subway_cleaned %>%
dplyr::select(entrance_type, staffing, ada, free_crossover, station_latitude, station_longitude, station_name)
set.seed(123) # For reproducibility
km_result <- clustering_data %>%
dplyr::select(-station_latitude, -station_longitude, -station_name) %>%
kmodes(modes = 3)
We cluster the stations into three groups (levels) of overall accessibility based on their internal accessibility features.
clustering_data$.cluster <- factor(km_result$cluster)
scenario_counts <- clustering_data %>%
mutate(
scenario = case_when(
ada == "TRUE" & free_crossover == "TRUE" ~ "ADA=TRUE, Free=TRUE",
ada == "FALSE" & free_crossover == "FALSE" ~ "ADA=FALSE, Free=FALSE",
ada == "FALSE" & free_crossover == "TRUE" ~ "ADA=FALSE, Free=TRUE",
TRUE ~ NA_character_
)
) %>%
filter(!is.na(scenario)) %>%
group_by(scenario, .cluster) %>%
summarise(n = n(), .groups = "drop")
plot_ly(
data = scenario_counts,
x = ~scenario,
y = ~n,
color = ~.cluster,
type = "bar"
) %>%
layout(
title = "Count of Stations by Scenario and Cluster",
xaxis = list(title = "Scenario"),
yaxis = list(title = "Count of Stations"),
barmode = "group",
bargap = 0.2
)
From the plot, we can see that stations with the highest
accessibility levels are less likely to have both ADA compliance and
free_crossover points, while the lowest accessibility
stations are more likely to have neither. In scenarios where stations
only offer free_crossover, the more accessible a station is
overall, the less likely it is to only have that one feature. However,
most of the stations only meet the basic amenity standard of providing
free crossover.
# Define a Mode function
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
cluster_profiles <- clustering_data %>%
group_by(.cluster) %>%
dplyr::select(-station_latitude, -station_longitude) %>%
dplyr::select(station_name, everything()) %>%
summarise(across(everything(), ~ Mode(.x)))
knitr::kable(cluster_profiles)
| .cluster | station_name | entrance_type | staffing | ada | free_crossover |
|---|---|---|---|---|---|
| 1 | 59th St | Stair | FULL | FALSE | TRUE |
| 2 | 25th St | Stair | FULL | FALSE | TRUE |
| 3 | 36th St | Stair | FULL | FALSE | TRUE |
As we display the results of our model, all dominant features tend to appear similar across the clusters. This outcome likely occurs due to the uneven distribution of certain categorical variables, as reflected in our visualization section.
clustering_data <- clustering_data %>%
mutate(
accessibility_level = case_when(
.cluster == 1 ~ "High Accessibility",
.cluster == 2 ~ "Medium Accessibility",
.cluster == 3 ~ "Low Accessibility"
),
accessibility_level = factor(accessibility_level,
levels = c("High Accessibility", "Medium Accessibility", "Low Accessibility"))
)
pal <- leaflet::colorFactor(
palette = c("chartreuse", "darkgoldenrod1", "brown2"), # Adjust colors as needed
domain = clustering_data$accessibility_level
)
leaflet() |>
addTiles() |>
addCircleMarkers(data = clustering_data,
lng = ~station_longitude,
lat = ~station_latitude,
label = ~station_name,
radius = 4,
color = NA,
# color = ~pal(accessibility_level),
fillColor = ~pal(accessibility_level),
stroke = TRUE, fillOpacity = 0.75,
popup = ~paste("Ada:", ada,
"<br> Staffing:", staffing,
"<br> Entrance type:", entrance_type,
"<br> Free crossover:", free_crossover)) |>
addProviderTiles(providers$CartoDB.Positron) |>
addLegend(
"bottomright",
pal = pal,
values = clustering_data$accessibility_level,
title = "Accessibility Level",
opacity = 1
)
Next, we examine where high- and low-accessibility stations are located in the city. It appears that areas with a high concentration of low-accessibility stations include Downtown Manhattan, the Bronx, and Downtown Brooklyn.
subway_with_restroom = read_csv("Data/cleaned_subway_restroom_data.csv")
subway_with_restroom <- subway_with_restroom %>%
mutate(
# #convert to logical
restroom_changing_stations_logic = as.logical(restroom_changing_stations),
restroom_status_logic = as.logical(restroom_status),
restroom_accessibility = fct_explicit_na(restroom_accessibility, na_level = "Unknown"),
restroom_open = fct_explicit_na(restroom_open, na_level = "Unknown")
)
We import data on subway stations along with information on their
closest accessible restrooms. We then adjust the dataset to handle
missing values and ensure that variables
(restroom_changing_stations, restroom_status,
restroom_accessibility, restroom_open) are
suitable for k-modes clustering.
# Select the Variables for Clustering
clustering_merged_data <- subway_with_restroom %>%
dplyr::select(entrance_type, staffing, ada, free_crossover, station_latitude, station_longitude, station_name,
restroom_open, restroom_accessibility,restroom_changing_stations_logic, restroom_status_logic)
set.seed(123) # For reproducibility
km_result2 <- clustering_merged_data %>%
dplyr::select(-station_latitude, -station_longitude, -station_name) %>%
klaR::kmodes(modes = 3)
clustering_merged_data$.cluster <- factor(km_result2$cluster)
# Define a Mode function
Mode_for_merged <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
cluster_merged_profiles <- clustering_merged_data %>%
group_by(.cluster) %>%
dplyr::select(-station_latitude, -station_longitude) %>%
dplyr::select(station_name, everything()) %>%
summarise(across(everything(), ~ Mode_for_merged(.x)))
knitr::kable(cluster_merged_profiles)
| .cluster | station_name | entrance_type | staffing | ada | free_crossover | restroom_open | restroom_accessibility | restroom_changing_stations_logic | restroom_status_logic |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 36th St | Stair | FULL | FALSE | TRUE | Year Round | Fully Accessible | FALSE | TRUE |
| 2 | 45th St | Stair | FULL | FALSE | TRUE | Year Round | Unknown | FALSE | TRUE |
| 3 | 25th St | Stair | FULL | FALSE | FALSE | Year Round | Not Accessible | FALSE | TRUE |
From the table of our model fitting result, we speculate that the
restroom_accessibility has significant effect on the
cluster result.
clustering_merged_data %>%
group_by(.cluster, restroom_accessibility) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(.cluster, restroom_accessibility) %>%
knitr::kable()
| .cluster | restroom_accessibility | n |
|---|---|---|
| 1 | Fully Accessible | 176 |
| 1 | Not Accessible | 14 |
| 1 | Partially Accessible | 8 |
| 2 | Not Accessible | 13 |
| 2 | Partially Accessible | 6 |
| 2 | Unknown | 77 |
| 3 | Fully Accessible | 19 |
| 3 | Not Accessible | 39 |
| 3 | Partially Accessible | 2 |
| 3 | Unknown | 2 |
From the model’s results, we speculate that restroom_accessibility
has a significant effect on the clustering outcomes. Most
high-accessibility stations have fully accessible restrooms nearby,
while medium- and low-accessibility stations often lack accessible
restrooms or have missing data for those amenities. This finding
supports our speculation that restroom_accessibility is a
dominant factor in the model.
clustering_merged_data <- clustering_merged_data %>%
mutate(
accessibility_level = case_when(
.cluster == 1 ~ "High Accessibility",
.cluster == 2 ~ "Medium Accessibility",
.cluster == 3 ~ "Low Accessibility"
)
)
pal <- leaflet::colorFactor(
palette = c("chartreuse", "darkgoldenrod1", "brown2"), # Adjust colors as needed
domain = clustering_merged_data$accessibility_level
)
leaflet() |>
addTiles() |>
addCircleMarkers(data = clustering_merged_data,
lng = ~station_longitude,
lat = ~station_latitude,
label = ~station_name,
radius = 4,
color = NA,
# color = ~pal(accessibility_level),
fillColor = ~pal(accessibility_level),
stroke = TRUE, fillOpacity = 0.75,
popup = ~paste("Ada:", ada,
"<br> Staffing:", staffing,
"<br> Entrance type:", entrance_type,
"<br> Free crossover:", free_crossover)) |>
addProviderTiles(providers$CartoDB.Positron) |>
addLegend(
"bottomright",
pal = pal,
values = clustering_merged_data$accessibility_level,
title = "Accessibility Level",
opacity = 1
)
The distribution plot also aligns with our earlier speculation.
There are large proportion of variables in
subway_clean are un-evenly distributed
In our analysis, we only include the restroom that is closest to each stations
One potential solution is to balance or reweight the data to reduce the impact of uneven distributions and incorporate multiple nearby restrooms into the analysis to gain a more comprehensive understanding of accessibility.